Lecture 6: Column transformer and text features

Varada Kolhatkar

Announcements

  • Lecture recordings for the first two weeks have been made available.
  • HW3 is due next week Monday, Oct 1st, 11:59 pm.
    • You can work in pairs for this assignment.

Recap: Preprocessing mistakes

Data

# make synthetic data
X, y = make_blobs(n_samples=100, centers=3, random_state=12, cluster_std=5)
# split it into training and test sets
X_train_toy, X_test_toy, y_train_toy, y_test_toy = train_test_split(
    X, y, random_state=5, test_size=0.4)
plt.scatter(X_train_toy[:, 0], X_train_toy[:, 1], label="Training set", s=60)
plt.scatter(
    X_test_toy[:, 0], X_test_toy[:, 1], color=mglearn.cm2(1), label="Test set", s=60
)
plt.legend(loc="upper right")

Recap: Bad methodology 1

  • What’s wrong with scaling data separately?
scaler = StandardScaler()
scaler.fit(X_train_toy)
train_scaled = scaler.transform(X_train_toy)

scaler = StandardScaler()  # Creating a separate object for scaling test data
scaler.fit(X_test_toy)  # Calling fit on the test data
test_scaled = scaler.transform(
    X_test_toy
)  # Transforming the test data using the scaler fit on test data

knn = KNeighborsClassifier()
knn.fit(train_scaled, y_train_toy)
print(f"Training score: {knn.score(train_scaled, y_train_toy):.2f}")
print(f"Test score: {knn.score(test_scaled, y_test_toy):.2f}")
Training score: 0.63
Test score: 0.60

Scaling train and test data separately

Recap: Bad methodology 2

  • What’s wrong with scaling the data together
# join the train and test sets back together
XX = np.vstack((X_train_toy, X_test_toy))

scaler = StandardScaler()
scaler.fit(XX)
XX_scaled = scaler.transform(XX)

XX_train = XX_scaled[:X_train_toy.shape[0]]
XX_test = XX_scaled[X_train_toy.shape[0]:]

knn = KNeighborsClassifier()
knn.fit(XX_train, y_train_toy)
print(f"Training score: {knn.score(XX_train, y_train_toy):.2f}")  # Misleading score
print(f"Test score: {knn.score(XX_test, y_test_toy):.2f}")  # Misleading score
Training score: 0.63
Test score: 0.55

Bad methodology 3 (class discussion)

  • What’s wrong here?
knn = KNeighborsClassifier()

scaler = StandardScaler()
scaler.fit(X_train_toy)
X_train_scaled = scaler.transform(X_train_toy)
X_test_scaled = scaler.transform(X_test_toy)
scores = cross_validate(knn, X_train_scaled, y_train_toy, return_train_score=True)
pd.DataFrame(scores)
fit_time score_time test_score train_score
0 0.000282 0.001294 0.250000 0.687500
1 0.000190 0.000635 0.500000 0.625000
2 0.000177 0.000614 0.583333 0.541667
3 0.000174 0.000607 0.583333 0.604167
4 0.000177 0.000611 0.416667 0.604167

Improper preprocessing

Proper preprocessing

Recap: sklearn Pipelines

  • Pipeline is a way to chain multiple steps (e.g., preprocessing + model fitting) into a single workflow.
  • Simplify the code and improves readability.
  • Reduce the risk of data leakage by ensuring proper transformation of the training and test sets.
  • Automatically apply transformations in sequence.

Example:

Chaining a StandardScaler with a KNeighborsClassifier model.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

pipeline = make_pipeline(StandardScaler(), KNeighborsClassifier())

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Class demo

sklearn’s ColumnTransformer

  • Use ColumnTransformer to build all our transformations together into one object
  • Use a column transformer with sklearn pipelines.

Ordinal encoding vs. One-hot encoding

  • Ordinal Encoding: Encodes categorical features as an integer array.
  • One-hot Encoding: Creates binary columns for each category’s presence.
  • Sometimes how we encode a specific feature depends upon the context.

Ordinal encoding vs. One-hot encoding

  • Consider weather feature and its four levels: [‘Sunny’, ‘Cloudy’, ‘Rainy’, ‘Snowy’]
  • Predicting traffic volume: Using one-hot encoding would make sense here because the impact of different weather conditions on traffic volume does not necessarily follow a clear order and different weather conditions could have very distinct effects.
  • Predicting severity of weather-related road incidents: An ordinal encoding might be more appropriate if you define your weather categories from least to most severe as this could correlate directly with the likelihood or severity of incidents.

handle_unknown = "ignore" of OneHotEncoder

  • Using handle_unknown='ignore' with OneHotEncoder to safely ignore unseen categories during transform.
  • Is this a good approach in all scenarios?
from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

drop="if_binary" argument of OneHotEncoder

  • drop=‘if_binary’ argument in OneHotEncoder:
  • Reduces redundancy by dropping one of the columns if the feature is binary.

Categorical variables with too many categories

  • Strategies for categorical variables with too many categories:
  • Dimensionality reduction techniques
  • Bucketing categories into ‘others’

Dealing with text features

  • Preprocessing text to fit into machine learning models using text vectorization.
  • Bag of words representation

sklearn CountVectorizer

  • Use scikit-learn’s CountVectorizer to encode text data

  • CountVectorizer: Transforms text into a matrix of token counts

  • Important parameters: - max_df, min_df: Control document frequency thresholds. - ngram_range: Defines the range of n-grams to be extracted.

Incorporating text features in a machine learning pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline

text_pipeline = make_pipeline(
    CountVectorizer(),
    SVC()
)

(iClicker) Exercise 6.1

iClicker cloud join link: https://join.iclicker.com/VYFJ

Select all of the following statements which are TRUE.

    1. You could carry out cross-validation by passing a ColumnTransformer object to cross_validate.
    1. After applying column transformer, the order of the columns in the transformed data has to be the same as the order of the columns in the original data.
    1. After applying a column transformer, the transformed data is always going to be of different shape than the original data.
    1. When you call fit_transform on a ColumnTransformer object, you get a numpy ndarray.

(iClicker) Exercise 6.2

iClicker cloud join link: https://join.iclicker.com/VYFJ

Select all of the following statements which are TRUE.

    1. handle_unknown="ignore" would treat all unknown categories equally.
    1. As you increase the value for max_features hyperparameter of CountVectorizer the training score is likely to go up.
    1. Suppose you are encoding text data using CountVectorizer. If you encounter a word in the validation or the test split that’s not available in the training data, we’ll get an error.
    1. In the code below, inside cross_validate, each fold might have slightly different number of features (columns) in the fold.
pipe = (CountVectorizer(), SVC())
cross_validate(pipe, X_train, y_train)